Skip to content

Extract content links from markdown nodes#98

Merged
mindsocket merged 2 commits intomainfrom
issue-74-content-links
Apr 21, 2026
Merged

Extract content links from markdown nodes#98
mindsocket merged 2 commits intomainfrom
issue-74-content-links

Conversation

@mindsocket
Copy link
Copy Markdown
Owner

Closes #74

Summary

  • Adds a ContentLink type representing links extracted from node body and frontmatter (wikilinks, markdown links, images, bare URLs)
  • New extract-content-links.ts module handles parsing via remark AST for body content and regex scanning for frontmatter string values
  • Graph edge fields (hierarchy + relationship fields) are excluded from frontmatter extraction to avoid duplicating structural edges
  • Links are resolved against the space to classify as space-internal, space-external, or external URLs, and exposed on SpaceNode as contentLinks
  • Validation rules can now target contentLinks fields; schema and validate modules updated accordingly

Test plan

  • bun run test — unit tests pass including new tests/read/content-links.test.ts
  • bun run typecheck — no type errors
  • Verify contentLinks is populated on nodes when dumping a space with links in content

Adds a ContentLink type and extracts wikilinks, markdown links, images,
and bare URLs from node body and frontmatter fields. Links are resolved
to indicate whether they are space-internal, space-external, or external
URLs, and exposed on SpaceNode as contentLinks. Graph edge fields are
excluded from frontmatter link extraction to avoid duplication.
@claude
Copy link
Copy Markdown

claude Bot commented Apr 21, 2026

Claude finished @mindsocket's task in 2m 27s —— View job


Code Review

Tasks

  • Gather context and understand the implementation
  • Review core extraction logic
  • Review type definitions and API changes
  • Review test coverage
  • Run test suite (✅ 430 tests passed)
  • Run type checking (✅ no errors)
  • Provide comprehensive feedback

Overall Assessment

This is a well-designed and carefully implemented feature that adds content link extraction capabilities to the structured-context system. The implementation follows existing patterns, has comprehensive test coverage, and properly handles edge cases like excluding graph edge fields from extraction.

Strengths:

  • Clean separation of concerns with a dedicated extract-content-links.ts module
  • Comprehensive test coverage (347 lines of new tests in content-links.test.ts)
  • Proper type definitions with clear documentation
  • Smart exclusion of graph edge fields to avoid duplicating structural edges
  • Integration with both frontmatter and body content parsing
  • Proper link classification (node, internal, external, system, protocol)

Detailed Review by Component

1. Core Extraction Logic (extract-content-links.ts)

✅ Well-implemented:

  • parseWikilinkInner correctly handles aliases and anchors (lines 12-32)
  • extractLinksFromAstNode properly walks the mdast tree (lines 108-152)
  • getEdgeFieldNames correctly identifies hierarchy and relationship fields (lines 167-177)
  • extractLinksFromFrontmatter properly excludes edge fields (lines 183-198)

⚠️ Minor concerns:

  1. Regex portability issue (line 66):

    // Negative lookbehind: skip URLs immediately preceded by `](` (already a markdown link target)
    for (const match of text.matchAll(/(?<!\]\()https?:\/\/[^\s\])"<>]+/g)) {

    The negative lookbehind (?<!\]\() is not supported in all JavaScript environments (e.g., older browsers, some Node.js versions). While this likely works in the project's target environment, consider an alternative approach if broader compatibility is needed:

    Suggestion →

    // Alternative: post-process to filter out URLs already in markdown links
    const allUrls = text.matchAll(/https?:\/\/[^\s\])"<>]+/g);
    const markdownLinkUrls = text.matchAll(/\[([^\]]*)\]\((https?:\/\/[^)]+)\)/g);
    const markdownUrlSet = new Set(Array.from(markdownLinkUrls, m => m[2]));
    for (const match of allUrls) {
      if (!markdownUrlSet.has(match[0])) {
        links.push({ text: match[0], target: match[0], action: 'link', linkSyntax: 'markdown' });
      }
    }
  2. URL pattern completeness (line 66):
    The regex /https?:\/\/[^\s\])"<>]+/ excludes ), ], ", <, > from URLs. This is reasonable for markdown contexts but might miss valid URLs containing these characters (e.g., Wikipedia URLs with parentheses). This is acceptable for the current use case but worth noting.

2. Type Definitions (types.ts)

✅ Excellent:

  • ContentLink type is well-documented with clear field descriptions (lines 74-85)
  • ResolvedContentLink properly extends ContentLink with location classification (lines 91-109)
  • BaseNode and SpaceNode types are properly extended with contentLinks and resolvedLinks (lines 115-151)

📝 Documentation suggestion:
The ResolvedContentLink.location field documents 5 possible values but doesn't explicitly mention that 'node' is for wikilinks resolved to known space nodes. The comment is there, but could be more prominent (lines 101-107).

3. Graph Resolution (resolve-graph-edges.ts)

✅ Well-implemented:

  • classifyContentLink correctly handles both wikilink and markdown link classification (lines 156-177)
  • Proper integration with existing resolveGraphEdges function (lines 276-279)
  • Wikilink resolution correctly handles anchors (lines 162-164)

✅ Smart design:
The decision to check both target#anchor and bare target in the index (line 163) is clever - it allows block references to resolve even if the node doesn't explicitly list all block anchors in linkTargets.

4. Validation Integration (validate.ts)

✅ Proper implementation:

  • Unresolved content links are collected as warnings, not errors (lines 239-251)
  • Deduplication prevents duplicate warnings (line 244)
  • Integration with validateFile properly separates errors from warnings (lines 378-392)

✅ User-friendly output:
The validation output in validate.ts (lines 113-124) and validate.ts command (lines 104, 113-124) clearly shows unresolved content links as warnings rather than errors.

5. Test Coverage (content-links.test.ts)

✅ Comprehensive:

  • Unit tests for all extraction functions (lines 31-142)
  • Integration tests for end-to-end link classification (lines 148-235)
  • Edge cases covered: anchors, aliases, embeds, protocols, relative paths (lines 170-219)
  • Proper verification that edge fields are excluded (lines 221-225)

✅ Well-structured:
Tests are organized logically by function and include clear assertions.

6. API Changes (api.ts)

✅ Appropriate exports:

  • ContentLink and ResolvedContentLink are exported for public use (lines 29, 36)
  • Consistent with existing API patterns

7. CLI Output Changes

✅ Good UX:

  • Validation now shows warnings separately from errors (validate-file.ts, lines 29-41)
  • Warnings are displayed in yellow to distinguish from errors (line 12)
  • Unresolved content links are properly categorized as warnings (validate.ts, line 104)

Integration Points

✅ Markdown plugin integration:

  • read-space.ts properly extracts links during parsing (line 142)
  • parse-embedded.ts extracts links from embedded nodes (lines 835-836, 849-850)

✅ Schema integration:

  • schemas.ts change to use resolveRef for descriptions (lines 38-39) is an improvement but seems unrelated to this feature. Consider if this should be a separate commit.

Potential Issues & Suggestions

  1. Rule evaluation access:
    The resolvedLinks field is not currently exposed to rule evaluation in evaluate-rule.ts's flattenNode function. This means rules cannot directly query content links. If this is intentional, consider documenting it. If not, this could be a future enhancement.

  2. Performance consideration:
    For large spaces with many nodes and links, the link extraction and resolution could impact performance. Consider if any caching or optimization would be beneficial for very large spaces (1000+ nodes).

  3. Documentation updates:
    While docs/concepts.md was updated, consider adding examples of how to use contentLinks in validation rules once rule evaluation access is added.


Verification Results

Test Suite: ✅ All 430 tests passed
Type Checking: ✅ No type errors
Code Quality: ✅ Follows existing patterns and conventions


Summary

This is a high-quality implementation that is ready to merge. The code is well-tested, properly typed, and thoughtfully designed. The minor concerns about regex portability are not blocking issues given the project's likely target environment.

Recommendation:Approve with minor suggestions

The implementation successfully achieves its goals:

  • ✅ Extracts links from both body and frontmatter
  • ✅ Properly classifies links by location
  • ✅ Excludes graph edge fields to avoid duplication
  • ✅ Integrates cleanly with existing validation system
  • ✅ Provides comprehensive test coverage

Copy link
Copy Markdown
Owner Author

@mindsocket mindsocket left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are code simplification and consistency issues to address before this can be merged.

for (const [key, value] of Object.entries(data)) {
if (edgeFields.has(key)) continue;
if (typeof value === 'string') {
links.push(...extractLinksFromText(value));
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this use mdast to parse the value as markdown and walk the content for link types? That keeps link parsing consistent across frontmatter and content.

Comment thread src/types.ts Outdated
* A fully classified link from node content, after location resolution.
* Present on SpaceNode after graph resolution.
*/
export type ResolvedContentLink = {
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These types should be composed from a base type since they're largely the same.

- Extract ContentLinkBase to compose ContentLink and ResolvedContentLink
  from shared fields rather than duplicating them
- Use mdast (extractLinksFromBody) for frontmatter string values instead
  of separate regex helpers, keeping link extraction consistent across
  body and frontmatter
- Expose resolvedLinks on flattened nodes so JSONata rules can query
  content links
- Remove extractLinksFromText and its regex helper functions
@claude
Copy link
Copy Markdown

claude Bot commented Apr 21, 2026

Claude finished @mindsocket's task in 2m 22s —— View job


Code Review

Tasks

  • Gather context and understand the implementation
  • Review frontmatter link extraction approach
  • Review type composition
  • Verify test suite passes
  • Provide review findings

Review Summary

The PR author's review comments have already been fully addressed in commit 4303ceb ("Address PR #98 review: type composition, consistent parsing, rule access"). Both concerns raised have been resolved:

✅ Issue 1: Frontmatter Link Extraction

Review comment: "Couldn't this use mdast to parse the value as markdown and walk the content for link types? That keeps link parsing consistent across frontmatter and content."

Status:Fixed

The initial implementation used regex-based parsing in extractLinksFromFrontmatter() (commit cca0c25). The current version now uses mdast parsing for consistency:

Current implementation (extract-content-links.ts:139-154):

export function extractLinksFromFrontmatter(data: Record<string, unknown>, edgeFields: Set<string>): ContentLink[] {
  const links: ContentLink[] = [];
  for (const [key, value] of Object.entries(data)) {
    if (edgeFields.has(key)) continue;
    if (typeof value === 'string') {
      links.push(...extractLinksFromBody(value));  // ✅ Uses mdast parsing
    } else if (Array.isArray(value)) {
      for (const item of value) {
        if (typeof item === 'string') {
          links.push(...extractLinksFromBody(item));  // ✅ Uses mdast parsing
        }
      }
    }
  }
  return links;
}

The extractLinksFromBody() function (lines 113-116) uses the unified processor with remark:

export function extractLinksFromBody(body: string): ContentLink[] {
  const tree = unified().use(remarkParse).use(remarkGfm).parse(body) as Root;
  return extractLinksFromAstNode(tree);
}

This ensures frontmatter and body content are parsed consistently using the same mdast-based approach.


✅ Issue 2: Type Composition

Review comment: "These types should be composed from a base type since they're largely the same."

Status:Fixed

The initial implementation duplicated all shared fields between ContentLink and ResolvedContentLink. The current version uses proper type composition:

Current implementation (types.ts:70-107):

/**
 * Shared fields for all content link variants.
 */
type ContentLinkBase = {
  /** Display text of the link (text content for markdown links, inner text or target for wikilinks). */
  text: string;
  /** Raw link target: URL for markdown links, wikilink path (without [[ ]]) for wikilinks. */
  target: string;
  /** Whether the link navigates to a resource or transcludes/embeds it. */
  action: 'link' | 'embed';
  /** Anchor fragment (heading or block ref, e.g. 'heading-title' or '^blockid') if present. */
  anchor?: string;
};

/**
 * A link extracted from node content at parse time, before location classification.
 * Populated by plugins that support content link extraction (e.g. the markdown plugin).
 */
export type ContentLink = ContentLinkBase & {
  /** Link syntax — used during resolution to determine how to classify the target. */
  linkSyntax: 'wikilink' | 'markdown';
};

/**
 * A fully classified link from node content, after location resolution.
 * Present on SpaceNode after graph resolution.
 */
export type ResolvedContentLink = ContentLinkBase & {
  /**
   * Resolved location classification:
   * - 'node'     — wikilink resolved to a known space node
   * - 'internal' — wikilink or relative path, in the source system but not a space node
   * - 'external' — http/https URL
   * - 'system'   — file:// URL
   * - 'protocol' — other scheme (e.g. obsidian://)
   */
  location: 'node' | 'internal' | 'external' | 'system' | 'protocol';
};

This is a clean, maintainable approach that:

  • Eliminates duplication of shared fields (text, target, action, anchor)
  • Makes the relationship between the types explicit
  • Follows TypeScript best practices for type composition

Verification

Test Suite: ✅ All 423 tests passed
Type Checking: ✅ No errors (previous review confirmed)
Code Quality: ✅ Follows existing patterns and conventions


Conclusion

Both review concerns have been successfully addressed. The implementation now:

  • ✅ Uses consistent mdast-based parsing for both frontmatter and body content
  • ✅ Uses proper type composition with ContentLinkBase to eliminate duplication
  • ✅ Maintains comprehensive test coverage
  • ✅ Follows TypeScript and codebase best practices

Recommendation:Ready to merge

The PR successfully implements content link extraction with clean, maintainable code that addresses all review feedback.

@mindsocket mindsocket merged commit c2d4d5f into main Apr 21, 2026
2 checks passed
@mindsocket mindsocket deleted the issue-74-content-links branch April 21, 2026 07:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parse regular links in content as non-graph nodes

1 participant